AITopics

Country:

Asia > Thailand > Bangkok > Bangkok (0.05)
Africa > South Africa (0.04)
North America > Canada (0.04)
(4 more...)

Industry:

Consumer Products & Services > Restaurants (1.00)
Consumer Products & Services > Hotels (0.96)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Neural Information Processing SystemsFeb-11-2026, 18:17:57 GMT

c204d12afa0175285e5aac65188808b4-Supplemental-Conference.pdf

encoder, oiceblock, utterance, (17 more...)

Genre: Research Report > New Finding (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Bian, Wesley, Lin, Xiaofeng, Cheng, Guang

Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

arXiv.org Artificial IntelligenceNov-26-2025

Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.

artificial intelligence, machine learning, speech recognition, (13 more...)

2511.20534

Country: North America > United States > California > Los Angeles County > Los Angeles (0.29)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsOct-8-2025, 23:02:06 GMT

Supplementary Materials A Appendix 1 A.1 Construction & Schema Details 2 A.1.1 Conversation Details 3

large language model, machine learning, slot descrption, (19 more...)

Country:

Asia > Thailand > Bangkok > Bangkok (0.05)
Africa > South Africa (0.04)
North America > Canada (0.04)
(4 more...)

Industry:

Consumer Products & Services > Restaurants (1.00)
Consumer Products & Services > Hotels (0.96)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

arXiv.org Artificial IntelligenceSep-30-2025

From Sound to Setting: AI-Based Equalizer Parameter Prediction for Piano Tone Replication

Yu, Song-Ze

Abstract--This project presents an AI-based system for tone replication in music production, focusing on predicting EQ parameter settings directly from audio features. Unlike traditional audio-to-audio methods, our approach generates interpretable parameter values--such as EQ band gains--that musicians can further adjust in their workflow. Using a dataset of piano recordings with systematically varied EQ settings, we evaluate both regression and neural network models. Results show that our neural network model achieves highly accurate parameter predictions, with a mean squared error of 0.0216 on multi-band tasks. The proposed system enables practical, flexible, and automated tone matching for music producers, laying the foundation for future extensions to more complex audio effects.

artificial intelligence, machine learning, tone replication, (14 more...)

2509.24404

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.31)

Neural Information Processing SystemsAug-18-2025, 15:42:28 GMT

c204d12afa0175285e5aac65188808b4-Supplemental-Conference.pdf

artificial intelligence, machine learning, voiceblock, (19 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Mei, Katelyn Xiaoying, Choi, Anna Seo Gyeong, Schellmann, Hilke, Sloane, Mona, Koenecke, Allison

Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

arXiv.org Artificial IntelligenceJul-14-2025

Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.

artificial intelligence, audio file, machine learning, (19 more...)

2506.08846

Country: North America > United States > California (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.92)
Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Horppu, Ismo, Ayala, Frederick, Gulbenkoglu, Erlin

Hallucination Level of Artificial Intelligence Whisperer: Case Speech Recognizing Pantterinousut Rap Song

arXiv.org Artificial IntelligenceJun-24-2025

All languages are peculiar. Some of them are considered more challenging to understand than others. The Finnish Language is known to be a complex language. Also, when languages are used by artists, the pronunciation and meaning might be more tricky to understand. Therefore, we are putting AI to a fun, yet challenging trial: translating a Finnish rap song to text. We will compare the Faster Whisperer algorithm and YouTube's internal speech-to-text functionality. The reference truth will be Finnish rap lyrics, which the main author's little brother, Mc Timo, has written. Transcribing the lyrics will be challenging because the artist raps over synth music player by Syntikka Janne. The hallucination level and mishearing of AI speech-to-text extractions will be measured by comparing errors made against the original Finnish lyrics. The error function is informal but still works for our case.

machine learning, natural language, whisperer, (20 more...)

2506.16174

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Artificial IntelligenceJun-17-2025

SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

Alex, Tony, Ahmed, Sara, Mustafa, Armin, Awais, Muhammad, Jackson, Philip JB

Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the self-supervised pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio self-supervised learning (SSL) methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce S elf-S upervised L earning from A udio M ixtures (SSLAM), a novel direction in audio SSL research, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against state-of-the-art (SOT A) methods using a range of high-quality, publicly available polyphonic datasets. SS-LAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the AudioSet-2M(AS-2M), reaching a mean average precision (mAP) of 50.2. These results demonstrate SSLAM's effectiveness in both polyphonic and monophonic soundscapes, significantly enhancing the performance of audio SSL models. Code and pre-trained models are available at https://github.com/ta012/SSLAM .

large language model, machine learning, natural language, (17 more...)

2506.12222

Genre: Research Report > New Finding (0.48)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.67)
(2 more...)